The ability of deep convolutional neural networks (CNN) to learndiscriminative spectro-temporal patterns makes them well suited toenvironmental sound classification. However, the relative scarcity of labeleddata has impeded the exploitation of this family of high-capacity models. Thisstudy has two primary contributions: first, we propose a deep convolutionalneural network architecture for environmental sound classification. Second, wepropose the use of audio data augmentation for overcoming the problem of datascarcity and explore the influence of different augmentations on theperformance of the proposed CNN architecture. Combined with data augmentation,the proposed model produces state-of-the-art results for environmental soundclassification. We show that the improved performance stems from thecombination of a deep, high-capacity model and an augmented training set: thiscombination outperforms both the proposed CNN without augmentation and a"shallow" dictionary learning model with augmentation. Finally, we examine theinfluence of each augmentation on the model's classification accuracy for eachclass, and observe that the accuracy for each class is influenced differentlyby each augmentation, suggesting that the performance of the model could beimproved further by applying class-conditional data augmentation.
展开▼